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Abstract 

Variance components estimation and mixed model analysis are central themes in 
statistics with applications in numerous scientific disciplines. Despite the best efforts 
of generations of statisticians and numerical analysts, maximum likelihood estimation 
and restricted maximum likelihood estimation of variance component models remain 
numerically challenging. Building on the minorization-maximization (MM) principle, 
this paper presents a novel iterative algorithm for variance components estimation. 
MM algorithm is trivial to implement and competitive on large data problems. The 
algorithm readily extends to more complicated problems such as linear mixed models, 
multivariate response models possibly with missing data, maximum a posteriori estima¬ 
tion, penalized estimation, and generalized estimating equations (GEE). We establish 
the global convergence of the MM algorithm to a KKT point and demonstrate, both 
numerically and theoretically, that it converges faster than the classical EM algorithm 
when the number of variance components is greater than two and all covariance matrices 
are positive definite. 

Keywords: generalized estimating equations (GEE), global convergence, matrix 
convexity, linear mixed model (LMM), maximum a posteriori (MAP) estimation, max¬ 
imum likelihood estimation (MLE), minorization-maximization (MM), missing data, 
multivariate response, penalized estimation, restricted maximum likelihood (REML), 
variance components model 
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1 Introduction 


Variance components and linear mixed models are among the most potent tools in a statis¬ 
tician’s toolbox. They are essential topics in gradnate-level linear model conrses and the 


snbject of many cnrrent papers and research monographs (Rao and Kleffe, 1988, Searle et ah 


1992 

Rao 

1997, 

Khnri et ah 

1998 

Demidenko 

2013 


Their applications in agricnltnre, 
biology, economics, genetics, epidemiology, and medicine are too nnmerons to cover here in 


detail. The recommended books (Verbeke and Molenberghs, 2000, Weiss, 2005, Fitzmanrice 


et ah, 2011) stress longitndinal data analysis. 


Given an observed n x 1 response vector y and n x p predictor matrix X, the simplest 
variance components model postnlates that Y ~ N{Xf3, 17), where 




7=1 


and the Vi,, Vm are m fixed positive semidefinite matrices. The parameters of the model 
can be divided into mean effects (/3i,..., (3p) and variance components {af ,..., a^), snmma- 
rized by vectors (3 and cr^. Thronghont we assnme 17 is positive definite. The extension to 
singnlar 17 will not be pnrsned here. Estimation revolves aronnd the log-likelihood fnnction 


L(/3,cr") = -^Indetn-- X(3fQ,-\y - X(3). 


( 1 ) 


2 2 

Among the commonly nsed methods for estimating variance components, maximnm likeli¬ 


hood estimation (MLE) (Hartley and Rao, 1967) and restricted (or residnal) MLE (REML) 


(Harville, 1977) are the most popnlar. REML first projects y to the nnll space of X and 


then estimates variance components based on the projected responses. If the colnmns of the 
matrix B span the nnll space of then REML estimates the af by maximizing the log- 
likelihood of the redefined response vector B^Y, which is normally distribnted with mean 
0 and covariance B^CIB = o'fB'^ViB. 


There exists a large literatnre on iterative algorithms for finding MLE and REML (Laird 


and Ware, 1982[ Lindstrom and Bates 

1988, 1990| Harville and Callanan, 1990, Callanan 

and Harville, 

1991 

Bates and Pinheiro 

1998 

Schafer and Yncel 

2002 

). Fitting variance 


component models remains a challenge in models with a large sample size n or a large nnm- 


ber of variance components m. Newton’s method (Lindstrom and Bates, 1988) converges 


qnickly bnt is nnmerically nnstable owing to the non-concavity of the log-likelihood. Fisher’s 
scoring algorithm replaces the observed information matrix in Newton’s method by the ex¬ 
pected information matrix and yields an ascent algorithm when safegnarded by step halving. 
However the calcnlation and inversion of expected information matrices cost 0{mn^)+0{m^) 
flops for nnstrnctnred Vi and qnickly become impractical when either n or m is large. The 
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expectation-maximization (EM) algorithm initiated by Dempster et al. is a third alterna¬ 


tive ( 

Dempster et al. 

1977, Laird and Ware 

1982 

Laird et al. 

19871 

Lindstrom and Bates 

1988 

Bates and Pinheiro 

1998 

). Gompared to Newton’s method, the EM algorithm is easy 


te implement and numerically stable, but painfully slow to converge. In practice, a strat¬ 
egy of priming Newton’s method by a few EM steps leverages the stability of EM and the 
faster convergence of second-order methods. Quasi-Newton methods dispense with explicit 
calculation of the observed information while achieving a superlinear rate of convergence. 

In this paper we derive a minorization-maximization (MM) algorithm for finding the 
MLE and REML estimates of variance components. We prove global convergence of the 
MM algorithm to a Karush-Kuhn-Tucker (KKT) point and explain why MM generally con¬ 
verges faster than EM for models with more than two variance components. We also sketch 
extensions of the MM algorithm to the multivariate response model with possibly missing 
responses, the linear mixed model (LMM), maximum a posteriori (MAP) estimation, pe¬ 
nalized estimation, and generalized estimating equations (GEE). The numerical efficiency of 
the MM algorithm is illustrated through simulated data sets and a genomic example with 
more than 200 variance components. 


2 Preliminaries 


Background on MM algorithms 


Throughout we reserve Greek letters for parameters and indicate the current iteration num¬ 
ber by a superscript t. The MM principle for maximizing an objective function f(0) involves 
minorizing the objective function f{6) by a surrogate function g{6 \ 0^^^) around the current 
iterate 0*-*^ of a search (Lange et ah, 2000). Minorization is defined by the two conditions 


/(e'‘|) = (2) 

/(«) > 9(e|«“’). 


In other words, the surface 6 i— )■ g{6 \ 0 ^*^) lies below the surface 6 i—)■ f{6) and is tangent 
to it at the point 6 = 0^^\ Gonstruction of the minorizing function g{6 \ 6^*^) constitutes 
the first M of the MM algorithm. The second M of the algorithm maximizes the surrogate 
g{0 I 0W) rather than f{6). The point maximizing g{6 \ 0^*^) satisfies the ascent 

property This fact follows from the inequalities 


/(0(*+i)) > I 0W) > ^(0W I 0(0) = /(0(*)), 


( 3 ) 


reflecting the definition of and the tangency and domination conditions (j^. The 

ascent property makes the MM algorithm remarkably stable. The validity of the descent 
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property depends only on increasing g{6 \ not on maximizing g{6 \ 6^^'^). With obvious 
changes, the MM algorithm also applies to minimization rather than to maximization. To 
minimize a function /(0), we majorize it by a surrogate function g{0 \ and minimize 
g{6 I 0^*^) to produce the next iterate 0*-*^^^ The acronym should not be confused with the 
maximization-maximization algorithm in the variational Bayes context (Jeon, 2012). 

The MM principle (|De Leeuw 1994, Heiser, 1995, Kiers, 2002, Lange et al., 2000, Hunter 


and Lange, 2004; Wu and Lange, 2010) hnds applications in multidimensional scaling (Borg 


and Groenen, |2005), ranking of sports teams (Hunter, 2004), variable selection (Hunter and 


Li, 2005), optimal experiment design (Yu, 2010), multivariate statistics (Zhou and Lange 


2010), geometric programming (Lange and Zhou, 2014), and many other areas (|Lange 2010 


Chapter 12). The celebrated EM principle (Dempster et ah, 1977) is a special case of 
the MM principle. The Q function produced in the E step of an EM algorithm minorizes 
the log-likelihood up to an irrelevant constant. Thus, both EM and MM share the same 
advantages: simplicity, stability, graceful adaptation to constraints, and the tendency to 
avoid large matrix inversion. The more general MM perspective frees algorithm derivation 
from the missing data straitjacket and invites wider applications (Wu and Lange, 2010). 
Figure [T] shows the minorization functions of EM and MM for a variance components model 
with m = 2 variance components. 

EM and MM algorithms often exhibit slow convergence. Fortunately, this defect can 
be remedied by off-the-shelf acceleration techniques for hxed point iterations. The recently 
developed squared iterative method (SQUAREM) (Varadhan and Roland, 2008) and the 
quasi-Newton acceleration method ( Zhou et akf 2011[ ) are particularly attractive, given their 
simplicity and minimal memory and computational costs. Our numerical experiments feature 
the unadorned MM algorithm and the quasi-Newton accelerated MM (aMM) algorithm based 
on one secant pair. Using more secant pairs is likely to further improve performance. 


Convex matrix functions 

For symmetric matrices we write A ^ B when B — A is positive semidehnite and A -< B 
if .B — A is positive dehnite. A matrix-valued function / is said to be (matrix) convex if 

/|A^ + (l_A)S] :< a/(A) + (1 - A)/(S) 

for all A, B, and A G [0,1]. Our derivation of the MM variance components algorithm hinges 
on the convexity of the two functions mentioned in the next lemma. 

Lemma 1. (a) The matrix fractional function f{A,B) = A^B~^A is jointly convex in the 
mxn matrix A and the mxm positive definite matrix B. (b) The log determinant function 
f{B) = IndetB is concave on the set of positive definite matrices. 
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Figure 1; Log-likelihood surface of a 2-variance component model and the surrogate functions 
of EM and MM minorizing the objective function at point = (18.5,0.7). 
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Proof. The matrix fractional function is matrix convex because its epigraph 


{{A,B,C) : B yO,A^B^^A^C} = { {A, B,C) : B y 0, 


B A 
A'^ C 


P 0 


is a convex set. Here C varies over the set of n x n positive semidefinite matrices. The 


equivalence of these two epigraph representations is proved in Boyd and Vandenberghe (2004 


A.5.5). For the concavity of the log determinant, see Boyd and Vandenberghe (2004, p74). 

□ 


3 Univariate response model 


Our strategy for maximizing the log-likelihood ([^ is to alternate updating the mean param¬ 
eters (3 and the variance components cr^. Updating (3 given cr^ is a standard general least 
squares problem with solution 

Updating cr^ given depends on two minorizations. If we assume that all of the Vi are 
positive definite, then the joint convexity of the map (X,V) i—>■ X"^Y~^X for positive 
definite Y implies that 

n 


= I af‘'v ,) ( ^ 4“T 


2(t). 


. 2 = 1 
m 

v 


. 2=1 


. 2=1 


O’, 


2(0 


=1 Uj '^3 

m 2(t) 

O’/ 


2(t) 

^2(0^ 


m 2{t) 


2(0 ^ 2(0 
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2(0 




(J, 


2(0 


2(0 


O',- 


-1 


2(t) 

SiE _ 2(0^ 


O', 


2(0 


Eywn 


2=1 


m 4(t) 

E^v. 


i=l 


a. 


2 


When one or more of the Vj are rank deficient, we replace each V by Vj ^ = Vi + el for e > 0 
small. Sending e to 0 in the just proved majorization 


m 4(0 
0-,- 


^ (y~ 


i=l 
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gives the desired majorization 






(Ta 


m 

V- 

2 


cr; 


in the general case. Negating both sides leads to the minorization 

/ ^ 4(t) \ 

-{y-XI3fn-\y-Xl3) t -{y-X/}fn->‘''\T^vAsi-'‘Hy-X/}) (5) 

that effectively separates the variance components af,..., cr^ in the quadratic term of the 
log-likelihood ([^. 

The convexity of the function An-)- — log det A is equivalent to the supporting hyperplane 
minorization 


-Indetfl > -IndetriW-tr[f2-W(ri - riW)] 


( 6 ) 


that separates af ,..., a'^ in the log determinant term of the log-likelihood Q. Combination 
of the minorizations ([^ and ([^ gives the overall minorization 


I o- 


2{t)A 


1 1 f 4(i) \ 

--tr(f2-Wf2) - -{y - f ^ -^vA + 


,(t) 


( 7 ) 


E 

2=1 


2 1 


4(i) 


tr(ri-W\^) --^(y - - Xf3 


(t)^ 


2 2 ( 7.2 


+ cW, 


where is an irrelevant constant. Maximization of g{c 
the lovely multiplicative update 

2(q 


with respect to af yields 


2(t+l) 

(Ta = <Ja 


{y - Xf5^^'^Yn-^*^Vin-^*\y - Xf3 


it) A 


tr(f2-«l^) 


i = 1,..., m. 


( 8 ) 


To preserve the uniqueness and continuity of the algorithm map, we must take = 0 

0(f\ 

whenever crd = 0. As a sanity check on our derivation, consider the partial derivative 

Given a^'' > 0, it is clear from the update formula ([8) that when < 0. 

Conversely when > 0. Algorithm a summarizes the MM algorithm for 

MLE of the univariate response model ([^. 
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Input : t/, X, Vi,...,Kn 
Output; MLE (jf,..., (T^ 


1 Initialize a. 

2 repeat 


( 0 ) 


> 0, i = 1,..., m 


n"' ^ Efc. ■jf’K 

^ argmin^ {y — Xf3)^fl^^^\y — Xfi) 

2{t+l) 


a. 


^ a. 


m 


5 until objective value converges 


z = 1,..., m 


Algorithm 1: MM algorithm for MLE of the variance components of model Q. 


The update formula (|^ assumes that the numerator under the square root sign is non¬ 
negative and the denominator is positive. The numerator requirement is a consequence of 
the positive semidefiniteness of V^. The denominator requirement can be verified through 
the Hadamard (elementwise) product representation tr(r2“*'*^Vi) = O T^)l. The 

following lemma of Schur (1911) is crucial. We give a self-contained probabilistic proof. 


Lemma 2 (Schur). The Hadamard product of a positive definite matrix with a positive 
semidefinite matrix with positive diagonal entries is positive definite. 


Proof. Let X = (Xi,..., X„)^ be a random normal vector with mean 0 and positive definite 
covariance matrix A. Let Y = (Yi,... be a random normal vector independent of X 

with mean 0 and positive semidefinite covariance matrix B having positive diagonal entries. 
Then Z = X QY has covariances Pj{ZiZj) = E(Xjl^XjYj) = Pj^XiXfiPjiYiYj) = Oijbij. It 
follows that Cov(Z’) = AQ B. To show AQ B is positive definite, suppose on the contrary 
that v'^{A 0 B)v = Var(u'^Z) = 0 for some u ^ 0. Then 


0 = Var(u" Z) = E( V nXiYi ) = E 






'^i.XjYi 


= E[(u 0 Y)^A(u 0 Y)] 


implies u 0 Y = 0 with probability 1. Since u 7 ^ 0, Yj = 0 with probability 1 for some i. 
This contradicts the assumption bn = Var(Yj) > 0 for all i. □ 


We can now obtain the following characterization of the MM iterates. 

Proposition 1. Assume V) has strictly positive diagonal entries. Then tr(f2“*'*^V)) > 0 for 
all t. Furthermore > 0 and ^ null{Vi) for all t, then > 0 for 

all t. When V) is positive definite, > 0 holds if and only if y X(3^^\ 


Proof. The first claim follows easily from Schur’s lemma. The second claim follows by 
induction. The third claim follows from the observation that null(Vi) = {0}. □ 










In most applications, Vm = I- Proposition [^guarantees that if cxm^^ > 0 and the residual 
vector y — is nonzero, then remains positive and thus remains positive 

definite throughout all iterations. This fact does not prevent any of the sequences ^ 
horn, converging to 0. In this sense, the MM algorithm acts like an interior point method, 
approaching the optimum from inside the feasible region. 

Univariate response: two variance components 


Input ■.y,X,Vi,V 2 
Output: MLE 

1 Simultaneous congruence decomposition; {D,U) (Vi, U) 

2 Transform data: y ^ U^y,X U^X 

3 Initialize a® > 0 

4 repeat 




n 


^ argmin^ Y.l=iwt\yi - xf (3) 

2(t+l) 2{t) 


a 


a 




r'l-lO-, 




tr[((T^^‘^r>+(72<‘T)-ir>] 


(To 




tr[(f72^‘)r>+f72^'T)-l] 

8 until objective value converges 

Algorithm 2: Simplihed MM algorithm for MLE of model Q with m = 2 variance 
components and fl = afVi + a 2 V 2 . 

The major computational cost of Algorithm 1 is inversion of the covariance matrix 
at each iteration. Problem specific structures such as block diagonal matrices or a diagonal 
matrix plus a low-rank matrix are often exploited to speed up matrix inversion. The special 
case of m = 2 variance components deserves attention as repeated matrix inversion can be 
avoided by invoking the simultaneous congruence decomposition for two symmetric matrices. 


2 (‘) i 


one of which is positive definite ( 

Rao 

1973 

Horn and Johnson, 

1985 

). This decomposition 

is also called the generalized eigenvalue decomposition ( 

Golub and Van Loan, 1996, Boyd 


and Vandenberghe, 2004). If one assumes f2 = afVi + cr^U and lets (Vi, U) {D, U) be 


the decomposition with U nonsingular, U"^ViU = D diagonal, and = /, then 


ndi 

n-di 

det{ndl) 


= u 


-T(2(t) 


1 


a 

2(0 


D + 

. 2 ( 0 : 


In)U 


-1 


T 


det{al^*'> D + In) det{U 


-T 


u 


( 10 ) 


= det((Ji^*^iI> + (72''''’/^) detCU)- 


2(0 
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With the revised responses y = U^y and the revised predictor matrix X = U^X, the 
npdate ([^ reqnires only vector operations and costs 0{n) flops. Updating the flxed effects is 
a weighted least sqnares problem with the transformed data {y,X) and observation weights 
+ (72*'*^)“^. Algorithm I snmmarizes the simplified MM algorithm for two 
variance components. 

Numerical experiments 

This section compares the nnmerical performance of MM, qnasi-Newton accelerated MM, 
EM, and Fisher scoring on simnlated data from a two-way ANOVA random effects model 
and a genetic model. For ease of comparison, all algorithm rnns start from = 1 and 
terminate when the relative change the log-likelihood is less 

than 10“®. 

Two-way ANOVA: We simnlated data from a two-way ANOVA random effects model 


Uijk — y + oti -\r (3j + -I- 6^^^, 


where Oj ~ A(0,(T^), I3j ~ A(0 ,ct|), {a(3)ij ~ A( 0 ,(T 3 ), and ~ N{0,al) are jointly 
independent. This corresponds to m = 4 variance components. In the simnlation, we set 
cr| = cr| = (jg and varied the ratio a\/a\\ the nnmbers of levels a and b in factor 1 and factor 
2, respectively; and the nnmber of observations c in each combination of factor levels. For 
each simnlation scenario, we simnlated 50 replicates. The sample size was n = abc for each 
replicate. 

Tables [T] and show the average nnmber of iterations and the average rnntimes when 
there are a = b = 5 levels of each factor. Based on these resnlts and fnrther resnlts not 
shown for other combinations of a and b, we draw the following conclnsions. Fisher scoring 
takes the fewest iterations. The MM algorithm always takes fewer iterations than the EM 
algorithm. Accelerated MM fnrther improves the convergence rate of MM. The faster rate 
of convergence of Fisher scoring is ontweighed by the extra cost of evalnating and inverting 
the covariance matrix. When the sample size n = abc is large, Fisher scoring takes mnch 
longer than either EM or MM. 

Genetic model: We simnlated a qnantitative trait y from a genetic model with two 
variance components and covariance matrix f2 = -|- cjg/, where $ is a fnll-rank empiri¬ 

cal kinship matrix estimated from the genome-wide measnrements of 212 individnals nsing 


Option 29 of the Mendel software (Lange et al., 2013, 2005). In this example, Fisher scoring 


excels at smaller cr^/crf ratios, while accelerated MM is fastest at larger ul/al ratios. 

In snmmary, the MM algorithm appears competitive even in small-scale examples. Mod¬ 
ern applications often involve a large nnmber of variance components. In this setting, the 
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EM algorithm suffers from slow convergence and Fisher scoring from an extremely high cost 
per iteration. Our genomic example in Section reinforces this point. 


4 Global convergence of the MM algorithm 


The KKT necessary conditions for a local maximum cr^ = {af,..., cr^) of the log-likelihood 
Q require each component of the score vector to satisfy 

s , /{O} 

|^(—oo,0] af = 0 . 

In this section we establish the global convergence of Algorithmic to a KKT point. To reduce 
the notational burden, we assume that X is null and omit estimation of fixed effects (3. The 
analysis easily extends to the MLE case. Our convergence analysis relies on characterizing the 
properties of the objective function T(cr^) and the MM algorithmic mapping cr^ i—)■ M(cr^) 
defined by equation (|C). Special attention must be paid to the boundary values af = 0. We 
prove convergences for two cases, which cover most applications. 


Assumption 1. All Vj are positive definite. 


Assumption 2. Vi is positive definite, each Vj is nontrivial, l-L = span{V 2 ,..., VA} has 
dimension q < n, and y ^T-L. 


The genetic model in Section satisfies Assumption 1, while the two-way ANOVA model 
satisfies Assumption 2. The key condition y ^ span{V 2 ,..., Vm} in the second case is critical 


for the existence of an MLE or an REML (Demidenko and Massam, 1999, Grzqdziel and 


Michalski, 2014). We will derive a sequence of lemmas en route to the global convergence 


result declared in Theorem [TJ 


Lemma 3. Under Assumption 1 or 2, the log-likelihood function Q is coercive in the sense 
that the super-level set Sc = {cr‘^ > 0 : L{cr‘^) > c} is compact for every c. 


Proof. Let us first prove the assertion when all of the covariance matrices Vj are positive 
definite. If we set r = ||cr^||i and Oj = r~^af for each i, then the log-likelihood satisfies 




lit 

-^Inr - -Indet 

i=l 



IIL ^ 

aiV^ y. 


The functions Indet y of a are defined and continuous 

on the unit simplex and hence bounded there. The dominant term — | In r of the loglikelihood 
tends to —oo as r tends to oo. 
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(y\la\ Method c = ^ observations per combination 




2 

8 

20 

50 

0 

MM 

34.52(15.79) 

25.90(8.69) 

18.62(7.22) 

15.48(5.34) 


aMM 

16.68(5.69) 

13.48(3.47) 

11.76(3.12) 

10.88(2.43) 


EM 

123.70(63.72) 

61.58(31.36) 

38.44(18.58) 

25.66(10.31) 


FS 

6.10(1.09) 

6.74(0.99) 

6.68(0.79) 

6.36(0.72) 

0.05 

MM 

27.78(13.05) 

22.82(8.96) 

19.82(6.55) 

15.48(3.97) 


aMM 

14.80(4.33) 

12.32(3.27) 

12.08(2.62) 

11.20(2.52) 


EM 

108.04(62.58) 

58.42(33.67) 

43.52(19.48) 

27.62(12.47) 


FS 

6.20(1.29) 

6.72(1.25) 

6.62(0.73) 

6.60(1.07) 

0.1 

MM 

31.26(14.90) 

23.38(9.21) 

16.84(6.72) 

14.88(4.56) 


aMM 

15.96(5.65) 

12.72(3.59) 

10.36(2.51) 

10.80(2.46) 


EM 

112.12(72.70) 

62.26(28.87) 

34.86(22.61) 

24.10(11.96) 


FS 

6.10(1.25) 

6.90(0.79) 

6.48(0.86) 

6.52(0.86) 

1 

MM 

29.72(15.85) 

22.72(10.86) 

17.78(8.18) 

13.94(4.73) 


aMM 

15.24(5.60) 

12.40(4.12) 

10.72(2.70) 

10.24(2.01) 


EM 

85.86(63.85) 

41.50(30.46) 

28.40(20.02) 

21.36(13.86) 


FS 

5.96(1.19) 

6.90(0.91) 

6.36(1.05) 

6.44(0.93) 

10 

MM 

16.46(9.74) 

13.28(7.75) 

12.80(6.41) 

10.74(3.67) 


aMM 

11.60(3.70) 

9.36(2.78) 

9.04(3.00) 

8.68(2.54) 


EM 

24.50(32.87) 

16.18(23.06) 

15.10(16.55) 

12.36(11.13) 


FS 

6.98(0.80) 

6.96(0.70) 

6.74(0.83) 

6.76(0.52) 

20 

MM 

17.34(10.70) 

14.20(6.79) 

11.58(4.46) 

10.16(4.26) 


aMM 

12.12(5.96) 

9.92(2.68) 

8.92(2.07) 

8.48(2.00) 


EM 

31.08(42.11) 

20.50(24.55) 

10.84(10.86) 

8.98(8.94) 


FS 

7.18(0.98) 

7.02(0.82) 

6.90(0.74) 

6.78(0.79) 


Table 1: Average iterations until convergence for MM, quasi-Newton accelerated MM (aMM), 
EM, and Fisher scoring (FS) for htting a two-way ANOVA model with a = b = 5 levels of 
both factors. Standard errors are given in parentheses. 
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Method 


c = ^ observations per combination 

2 

8 

20 

50 

0 

MM 

30.50(81.58) 

132.50(48.51) 

739.80(272.66) 

4004.10(1317.90) 


aMM 

16.76(36.57) 

92.06(34.42) 

638.76(231.86) 

3691.90(873.70) 


EM 

70.76(43.58) 

376.82(184.27) 

1912.30(918.56) 

8276.98(3269.26) 


FS 

17.52(27.99) 

241.06(51.79) 

4039.73(6955.91) 

43315.42(76563.64) 

0.05 

MM 

16.79(11.73) 

117.33(50.33) 

867.41(296.24) 

4083.60(1016.71) 


aMM 

14.23(14.21) 

80.93(31.83) 

692.54(196.19) 

3841.10(948.55) 


EM 

66.73(44.87) 

376.78(206.19) 

2291.69(1035.80) 

9054.02(3989.52) 


FS 

13.33(18.33) 

253.10(61.72) 

3198.17(379.00) 

34057.87(7132.36) 

0.1 

MM 

17.01(8.97) 

122.03(55.77) 

733.37(329.08) 

3992.52(1166.67) 


aMM 

12.39(8.86) 

88.34(33.87) 

593.27(174.93) 

3745.50(922.99) 


EM 

76.65(53.37) 

389.45(179.41) 

1814.91(1152.64) 

7951.40(3810.34) 


FS 

10.24(8.57) 

257.63(45.53) 

3140.15(481.08) 

33490.67(4533.51) 

1 

MM 

16.50(13.08) 

112.94(52.73) 

736.32(322.26) 

3746.87(1221.92) 


aMM 

9.86(4.10) 

80.98(39.49) 

585.98(158.65) 

3536.90(754.48) 


EM 

56.93(48.16) 

267.86(194.49) 

1465.45(986.31) 

7079.60(4430.10) 


FS 

15.75(17.26) 

262.49(44.28) 

3003.68(481.92) 

33215.82(4801.38) 

10 

MM 

10.80(11.16) 

70.94(47.94) 

545.71(256.97) 

2316.96(1022.51) 


aMM 

8.64(4.17) 

62.50(26.13) 

483.63(183.47) 

2317.43(1061.57) 


EM 

21.51(31.61) 

113.76(158.82) 

803.36(816.05) 

3256.65(2624.29) 


FS 

12.32(9.81) 

261.85(37.84) 

3190.52(394.75) 

26163.81(8451.96) 

20 

MM 

8.83(5.05) 

104.94(54.66) 

552.13(190.42) 

1706.71(680.84) 


aMM 

9.57(9.80) 

92.94(35.84) 

524.70(137.22) 

1750.99(489.96) 


EM 

23.13(31.17) 

175.12(198.18) 

642.39(576.82) 

2007.86(1901.66) 


FS 

12.71(11.90) 

340.81(48.29) 

3543.18(464.36) 

18796.59(2445.74) 


Table 2; Average run times (xlO“^ seconds) of MM, quasi-Newton accelerated MM (aMM), 
EM, and Fisher scoring (FS) for fitting a two-way ANOVA model with a = b = 5 levels of 
both factors. Standard errors are given in parentheses. 
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Method 

Iteration 

Runtime (10 ^ sec) 

Objective 

0 

MM 

88.10(29.01) 

778.24(305.37) 

-374.35(9.82) 


aMM 

23.65(5.74) 

293.16(146.23) 

-374.34(9.82) 


EM 

231.93(123.39) 

3509.02(1851.11) 

-374.41(9.83) 


FS 

5.05(1.24) 

137.76(65.74) 

-374.36(9.83) 

0.05 

MM 

84.97(31.18) 

710.56(260.24) 

-377.19(10.85) 


aMM 

23.05(5.45) 

272.04(67.01) 

-377.18(10.85) 


EM 

220.57(124.70) 

3292.87(1865.91) 

-377.25(10.85) 


FS 

5.08(1.21) 

136.47(33.18) 

-377.21(10.83) 

0.1 

MM 

82.45(34.39) 

673.96(268.23) 

-379.62(10.54) 


aMM 

22.55(6.01) 

269.55(86.69) 

-379.61(10.54) 


EM 

199.70(113.47) 

2917.71(1607.33) 

-379.68(10.54) 


FS 

4.97(1.03) 

129.71(40.51) 

-379.62(10.54) 

1 

MM 

31.00(15.59) 

160.21(80.45) 

-409.66(11.26) 


aMM 

12.55(5.38) 

90.21(43.54) 

-409.66(11.26) 


EM 

51.10(28.70) 

550.55(321.89) 

-409.67(11.26) 


FS 

4.60(0.59) 

80.28(25.56) 

-409.66(11.26) 

10 

MM 

72.67(39.23) 

374.80(209.31) 

-532.57(9.11) 


aMM 

20.15(10.18) 

146.25(81.06) 

-531.24(9.28) 


EM 

294.20(717.05) 

3079.82(7520.30) 

-532.71(9.11) 


FS 

10.18(4.92) 

168.63(80.34) 

-532.08(9.21) 

20 

MM 

78.35(34.32) 

425.40(188.08) 

-591.36(7.05) 


aMM 

14.80(6.53) 

117.14(71.14) 

-589.13(7.15) 


EM 

362.07(764.60) 

4144.92(8862.65) 

-591.62(6.82) 


FS 

10.93(4.75) 

181.48(83.96) 

-590.68(7.08) 


Table 3: Average performance of MM, quasi-Newton accelerated MM (aMM), EM, and 
Fisher scoring (FS) for fitting a genetic model. Standard errors are given in parentheses. 
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To prove the assertion under Assumption 2, consider first the case Vi = In- Setting 
ai = erf /al ioT i = 2,m reduces the loglikelihood to 


^ m 

L{al,a) = --Indet (in+ '^aiVi 


2 " 2 
The middle term on the right satisfies 


i=2 


2al 


+ y. ( 11 ) 


i=2 


- - In det (in + ^ aiVi) < 0 


i=2 


because det (/„ + — det In = 1- Now let U = (Uq, Un-q) be an n X n orthogonal 

matrix whose left columns Uq span H and whose right columns Un-q span The identity 




I, + EI2 a.U^V,U, 0 


1=2 


0 


'-n—q. 


follows from the orthogonality relations = Uff_gUq = Q(n-q)xn- This in turn implies 


In T ^ ^ OiiVi 


-1 


i=2 


= U 


y U 


n-q'-'q ^{n-q)> 

Q 


{Iq + TZ2 O^iU^V^q 


0 0 


0 I 


0 

u^ 


u^ 


^n—q 


n—q . 


= u u’^ 

^n-q^n-q' 


Therefore the quadratic term in equation ([II]) is bounded below by the positive constant 


y 


In + '^aiV^) y > y'^Un-qUn_qy = > 0. 


i=2 


Here the assumption y guarantees the projection property 7 ^ 0. 

Next we show that the loglikelihood tends to —cx) when af tends to 0 or 00 or when ||q :||2 
tends to 00 . The second of the two inequalities 


L{o-1,ol) < -^IncTi --Indet [/„ + - ^||P^ 


< -Jlnai^ - ^|| P ^ x 2/"2 


i=2 


-y\ 


2 " 2ai 

renders the claim about af obvious. To prove the claim about ck, we make the worst case 
choice af = ||P-^x^||^ in the first inequality. It follows that 


L{(tI,cx) < --Indet + - ^ln||P^ 


-2/f- 


n 


i=2 
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If aj tends to cx), then the inequality 


lib ^ I L 

--In det (in + ^ oiiV(^ < In det + ajV^ = ln(l + a^X^k) 

i=2 k=l 

holds, where the Xjk are the eigenvalues of Vj. At least one of these eigenvalues is positive 
because Vj is nontrivial. It follows that h((jQ, ck) tends to —oo in this case as well. 

For the general case where Vi is non-singular but not necessarily /„, let Vj ' be the 
symmetric square root of Vi and write 

m / m \ 

n + ^ <j1v, = vP" (^ + ) vi'\ 

i=2 \ i=2 J 

The above arguments still apply since each ' ViVi ' is nontrivial and y belongs to the 
span{V 2 ,..., Vm\ = >5 if and only if belongs to □ 

Lemma 4. The iterates possess the ascent property Furthermore, 

when L{M{(tI)) = L{(j‘l), cri fulfills the fixed point condition M(cr^) = ul, and each com¬ 
ponent satisfies either (i) aX = 0 or (ii) > 0 and ^L^crl) = 0. 

Proof. The ascent property is built into any MM algorithm. Suppose L(M(cr^)) = L{crl) at 
a point cr^ G M™. Then equality must hold in the string of inequalities ([^. It follows that 


and hence that M{crl) = crl. If > 0, the stationarity condition 


d 

daf 






0 


applies. The equivalence of the two displayed partial derivatives is a consequence of the fact 
that the difference /(cr^) — g{cr‘^ \ cr^) achieves its minimum of 0 at cr^ = cr^. □ 

Lemma 5. The distance between successive iterates ||cr^h+i) _ ^2(t)||^ converges to 0. 

Proof. Suppose on the contrary that ||cr^h+i) _^ 2 (t)||^ does not converge to 0. Then one can 
extract a subsequence {tk}k>i such that 

||cr2hfc+i) _ cr2hfe)||2 > e > 0 (12) 


for all k. Let Cq be the compact super-level set {cr^ : Since the sequence 

{cr^(*'')}fc>i is conhned to Cq, one can pass to a subsequence if necessary and assume that 
cr^hfc) converges to a limit cr^ and that cr^hj=+i) converges to a limit cr^^. Taking limits in the 
relation = M(cr^hfc)) ^nd invoking the continuity M(cr^) imply that = M(cr^). 
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Because the sequence L(cr^hfe)) jg monotonically increasing in k and bounded above on Cq, 
it converges to a limit L*. Hence, the continuity of L{cr'^) implies 


L(cr2) = limL(cr2(*'')) = L* = limL(cr2(*'“+^)) = ^(cr^J = L{M{(tI)). 


Lemma therefore gives = cr^, contradicting the bound \\(t\ 


|2 ^ 


> e 


entailed by inequality (12). 


□ 


Theorem 1. The MM sequence {cr^h) j^as at least one limit point. Every limit point is 
a fixed point of If the set of fixed points is discrete, then the MM sequence converges 

to one of them. Finally, when the iterates converge, their limit is a KKT point. 

Proof. The sequence is contained in the super-level compact set Cq dehned in 

Lemma and therefore admits a convergent subsequence -^vith limit As argued 

in Lemma|^ L(cr^(“)) = L(M(cr^(°°))). Lemma 0 now implies that is a hxed point of 

the algorithm map M(cr^). 


According to Ostrowski’s theorem (Lange, 2010, Proposition 8.2.1), the set of limit points 
of a bounded sequence {cr^h) jg connected and compact provided ||cr^h+i) _ cr2h)||2 — 0 . 
If the set of hxed points is discrete, then the connected subset of limit points reduces to a 
single point. Hence, the bounded sequence cr^h) converges to this point. When the limit 
exists, one can check that satishes the KKT conditions by proving that each zero 

component of has a non-positive partial derivative. Suppose on the contrary = 0 
and > 0. By continuity > 0 for all large t. Therefore, 


> cr, 


2(t) 


for all large t by the observation made after equation (|^. This behavior is inconsistent with 
the assumption that afi ^ 0. □ 


5 MM versus EM 

Examination of Tables and [^suggests that the MM algorithm usually converges faster than 
the EM algorithm. We now provide theoretical justihcation for this observation. Again for 
notational convenience, we consider the REML case where X is null. Since the EM principle 
is just a special instance of the MM principle, we can compare their convergence properties 
in a unihed framework. Consider an MM map M{6) for maximizing the objective function 
f{6) via the surrogate function g{6 \ 6^^^). Close to the optimal point 6°^, 

Q{t+1)_QOO _ (iM(0“)(6>W-0“), 

where dM{0°°) is the differential of the mapping M at the optimal point 0“ of f{0). Hence, 
the local convergence rate of the sequence coincides with the spectral 
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radius of dM{6°° 
demonstrate that 


Familiar calculations (McLachlan and Krishnan, 2008, Lange, 2010) 


dM{e^ 


[d^g{e^ I e^)]-^d^f{e^). 


In other words, the local convergence rate is determined by how well the surrogate surface 
g{6 I 0“) approximates the objective surface f{6) near the optimal point . In the EM 


literature, dM{6°°) is called the rate matrix (Meng and Rubin, 1991). Fast convergence 
occurs when the surrogate g{0 \ 0°°) hugs the objective f{0) tightly around 0°°. Figure 
shows a case where the MM surrogate locally dominates the EM surrogate. We demonstrate 
that this is no accident. 


McLachlan and Krishnan (2008) derive the EM surrogate 


5'EM(cr^ 




— - ^ [rank(Vj) Ina^^ + rank(T^) 


i=l 

m 


(J, 


2(0 4(i) 

-^2 - 

cr? 


- E 


a. 


m 

2 


i=l 


(J, 


minorizing the log-likelihood up to an irrelevant constant. Section S.3 of the Supplementary 
Materials gives a detailed derivation for the more general multivariate response case. The 
rank of the covariance matrix Vj appears because Vj may not be invertible. Both of the 
surrogates 5 'EM(cr^ | and \ are parameter separated. This implies 

that both second differentials and are diagonal. A 

small diagonal entry of either matrix indicates fast convergence of the corresponding variance 
component. Our next result shows that, under Assumption 1, on average the diagonal entries 
of dominate those of when m > 2. Thus, the EM 

algorithm tends to converge more slowly than the MM algorithm, and the difference is more 
pronounced as the number of variance components m grows. 

Theorem 2. Let 0^ be a common limit point of the EM and MM algorithms. Then 

both second differentials and d'^gare diagonal with 

rank(Vj) 




2(oo) 


2(oo) 






2 ^ 4 ( 00 ) 




2(oo) 


Furthermore, the average ratio 

1 ^ I 


(J, 


m 


2(oo) 


^ 9 I cr2(°°)) 

for m > 2 when all Vi have full rank n. 


= — > 


mn 


i=l 


-(oo) 2(oo) 

L' /J 


cr. 


= - 
m 


< 1 
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Proof. See Section ?? of the Supplementary Materials. 


□ 


Both the EM and MM algorithms must evaluate the traces tr(r2 and quadratic 

forms {y - at each iteration. Since these quantities are 

also the building blocks of the approximate rate matrices \ one can rationally 

choose either the EM or MM updates based on which has smaller diagonal entries measured 
by the £i, £ 2 , or £00 norms. At negligible extra cost, this produces a hybrid algorithm that 
retains the ascent property and enjoys the better of the two convergence rates. 


6 Extensions 

Besides its competitive numerical performance, Algorithm is attractive for its simplicity 
and ease of generalization. In this section, we outline MM algorithms for multivariate re¬ 
sponse models possibly with missing data, linear mixed models, MAP estimation, penalized 
estimation, and generalized estimating equations. 


6.1 Multivariate response model 

Consider the multivariate response model with nx d response matrix X, mean EX = XB, 
and covariance 

m 

n = Cov(vecX) = ^ri(8)Vi. 

i=l 


The px d coefficient matrix B collects the fixed effects, the Fj are unknown dx d covariance 
matrices, and the Vi are known n x n covariance matrices. If the vector vecX is normally 


distributed, then Y equals a sum of independent matrix normal distributions (Gupta and 


Nagar, 1999). We now make this assumption and pursue estimation of B and the Fj, which 


we collectively denote as F. Under the normality assumption, the Kronecker product identity 
vec{CDE) = (E'^ (g) C)vec{D) yields the log-likelihood 


L{B,r) = -^lYidetfl-^YeciX-XBffl-^vec(Y-XB) (13) 

X)vecBfCl-^[vecY - {la ® X)vec.B]. 

Updating B given F^*^ is accomplished by solving the general least squares problem met 
earlier in the univariate case. Maximization of the log-likelihood (13) is difficult due to the 
requirement that each Fj be positive semidefinite. Typical solutions involve reparameteri¬ 
zation of the covariance matrix (Pinheiro and Bates, 1996). The MM algorithm derived in 
this section gracefully accommodates the covariance constraints. 


= -^lndetf2 - ^[vecX - (Id 
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Updating F given reqnires generalizing the minorization ([^. In view of Lemma 
and the identities {A®B){C ® D) = (AC) ® (BD) and (A ® = A~^ ® B~ we have 




= m 


m 


Er 


(d 


2=1 

m 


m 




i=l 


-1 


m 


Er 

2 = 1 


(i) 


^ m-V(Ff ®L^)(F,®V,)-i(Ff ® V,) 

m ^^ 


—1 /"pld 


i=l 


^(r«r-‘r<‘>) 0 v, 


2=1 


or equivalently 




Vi 


2 = 1 




(14) 


This derivation relies on the invertibility of the matrices Vi. One can relax this assumption 
by substituting V^^i = Vi + eln for Vi and sending e to 0. 

The majorization (14) and the minorization ([^ jointly yield the surrogate 

-i m 

^(F |F(')) = --^|tr[Ll-«(r*® Vi)] + (veci^Wf[(FWF^lFf))®L^](veci^W)}+cW, 
2=1 

where is the n x d matrix satisfying vecR^^^ = r2~*'*Vec(l^ — XB^^^) and is an 
irrelevant constant. Based on the Kronecker identities (vecA)^vec B = ii(A^B) and 
vec{CDE) = (E^ ® C)vec{D), the surrogate can be rewritten as 


+ c 


^(F I F(')) = "2 ® V)] + tr(RW^L^RWFWFriFf)} 

2=1 
-l m 

= -- {tr[Ll-W(F, ® Vi)] + tr(Ff RW'^L^RWfWF r^)} + 


(d 


dd 


The first trace is linear in Fj with the coefficient of entry {Ti)jk equal to 

tr(n-“>V) = lJ(V0S2jf)l„. 

where is the (j, /c)-th nxn block of The matrix Mi of these coefficients can be 


written as 


M, = 


/l^ 0'^ 
0^ 


0^\ 
o' 


fv ... VA 


Vv, ... vj 


o 


yo'^ 0 ^ ... i^y 

= (Id 0 iA^[(idiI ® V) © ri-W](/rf © V). 


fi 0 ... o\ 
0 1 ... 0 

Vo 0 ... 1) 
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input : Y, X, 

output; MLE B, Fi,..., F^ 

1 Initialize Fj^°^ positive definite, i = 1,... ,m 

2 repeat 

3 I 

4 ^ argmin^ [vecY — {Id ® X)vecS]'^r2“^*^[vecl^ — {Id ® X)vecS] 

5 ^ reshape(f2“^*Vec(l^ — X, n, d) 

6 for z = 1,..., m do 

7 Cholesky ^ {Id 0 ln)^[(l4lj ® 0 0 In) 

8 Ff+^^ ^ ^(Ff 

9 end 

10 until objective value converges 


Algorithm 3: The MM algorithm for MLE of the multivariate response model (13). 


The directional derivative of ^'(F | F*-*^) with respect to Fj in the direction Aj is 

-itr(M,A,) + itr(FWi^«^V,i^WFf Fr^A^Fri) 

= -Mr(M,A,) + Mr(F^lFfi^«^V,i^WFf Fr^A,). 

Because all directional derivatives of ^'(F | F^*^) vanish at a stationarity point, the matrix 
equation 

Mi = F^lFfi^W^V^i^WFfF^l (15) 

holds. Fortunately, this equation admits an explicit solution. For positive scalers a and b, the 
solution to the equation b = x~^ax~^ is a: = ±y^ajb. The matrix analogue of this equation 
is the Riccati equation B = X~^AX~^ , whose solution is summarized in the next lemma. 

Lemma 6. Assume A and B are positive definite and L is the Cholesky factor of B. 
Then Y = L-^{L^ALf/^L-^ is the unique positive definite solution to the matrix equation 
B = X-^AX-\ 

Proof. Direct substitution shows that Y solves the equivalent equation XBX = A. To 
show uniqueness, suppose Y~^AY~^ = B and Z~^AZ~^ = B. The equations 

{B^/^YB^/^f = B^^^YBYB^!^ = B^^^AB^/^ 

{B^I^ZB^I^f = B^I^ZBZB^I^ = B^I’^AB^I’^ 

imply B^I'^YB^I'^ = B^/^ZB^/^ by virtue of the uniqueness of symmetric square root. Since 
is positive definite, Y = Z. □ 
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The Cholesky factor L in Lemma can be replaced by the symmetric square root of B. 
The solution, which is unique, remains the same. The Cholesky decomposition is preferred 
for its cheaper computational cost and better numerical stability. 

Algorithm [^summarizes the MM algorithm for htting the multi-response model (j^. Each 
iteration invokes m Cholesky decompositions and symmetric square roots oi d x d positive 
dehnite matrices. Fortunately in most applications, d is a small number. The following 
result guarantees the non-singularity of the Cholesky factor throughout iterations. 

Proposition 2. Assume Vi has strictly positive diagonal entries. Then the symmetric matrix 
M, = {u ® ® VO 0 01 n) is positive definite for all t. Furthermore if 

0 0 and no column of lies in the null space of V for all t, then >- 0 for all t. 

Proof. If Vi has strictly positive diagonal entries, then so does 1^1 J 0 V, and the Hadamard 
product (IrflJ 0 Vi) 0 is positive dehnite by Schur’s lemma. Since the matrix 0 
has full column rank d, the matrix Mi is also positive dehnite. Finally, if no column of 
lies in the null space of V, and P*-*^ is positive dehne, then is positive 

dehnite. The second claim follows by induction and Lemma □ 


Multivariate response, two variance components 

When there are m = 2 variance components f2 = P 10 V 1 -I-P 20 V 2 , repeated inversion of the 
nd X nd covariance matrix 17 reduces to a single nd x nd simultaneous congruence decom¬ 
position and, per iteration, two d x d Cholesky decompositions and one d x d simultaneous 
congruence decomposition. The simultaneous congruence decomposition of the matrix pair 
(V, V 2 ) involves generalized eigenvalues d = [di,..., dn) and a nonsingular matrix U such 
that U'^ViU = D = diag(<i) and U^VzU = I. If the simultaneous congruence decomposi¬ 
tion of (pP,pJ^) is with _ diag(AW) and 

then 


Q(t) ^ ($-W + 

Q-{t) = ($W 0C7)(AW 0i7 + /rf0 0C/)^ 

detl7(*) = det(AW 0i7 + /rf0/,,)det[($-W 0C/~^)^($-W 0C/~^)] 

= det(AW 0i7 + /rf0 V)det(pW 0 V) 

= det(AW 0 i7 -F/rf 0 V) det(P^*y det(V2)‘^. 


Updating the hxed ehects reduces to a weighted least squares problem for the transformed 
responses Y = U'^Y, transformed predictor matrix X = and observation weights 

{X^l^di + 1)“^. Algorithm 0 summarizes the simplihed MM algorithm. Detailed derivations 
are relegated to Section ?? of the Supplementary Materials. 
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Input : Y, X, Vi, V 2 
Output; MLE B, Fi, r 2 

1 Simultaneous congruence decomposition; {D,U) (Vi, V2) 

2 Transform data; Y ^ U^Y, X ^ U^X 

3 Initialize F® positive definite 

4 repeat 

5 

6 argmin^ [vec(X$^*^) — ^ X)vec-B]^(A‘'*^ (S) D + 

/„)-i[vec(:F$W) - (g) X)vecB] 

Cholesky ^ ^('Miag (tr {D{XfD + , fc = 1,..., d) 

Cholesky LfLf^ ^ $('Miag (^tr {{Xf D + , fc = 1,..., d) 

N2'’ ^ [(^ - 0 (dA^*)^ + 


Ff+i) ^ Mf '^i = 1,2 


12 until objective value converges 


Algorithm 4: MM algorithm for multivariate response model f2 = Fi0Vi + F20V2 
with two variance components matrices. Note that 0 denotes a Hadamard quotient. 


6.2 Multivariate response model with missing responses 


In many applications the multivariate response model (13) involves missing responses. For 


instance, in testing multiple longitudinal traits in genetics, some trait values yij may be 
missing due to dropped patient visits, while their genetic covariates are complete. Missing 


data destroys the symmetry of the log-likelihood (13) and complicates finding the MLE. 


(16) 


Fortunately, MM algorithm easily adapts to this challenge. 

The familiar EM argument ( |McLachlan and Krishnaii 2008, Section 2.2) shows that 

- ^IndetflW _ ltr{fI-W[vec(ZW - XBW)vec(ZW - XB^^Y + 

minorizes the observed log-likelihood at the current iterate {B^^\ F^*\ ..., F^^). Here Z*-b is 
the completed response matrix given the observed responses Y^ and the current parameter 
values. The complete data Y is assumed to be normally distributed A(vec(X.B‘^*)), 17*-*^). 
The block matrix is 0 except for a lower-right block consisting of a Schur complement. 
To maximize the surrogate (16), we invoke the familiar minorization ([^ and majorization 


(14) to separate the variance components Fj. At each iteration we impute missing entries by 


their conditional means, compute their conditional variances and covariances to supply the 
Schur complement, and then update the fixed effects and variance components by the explicit 
updates of Algorithm The required conditional means and conditional variances can be 
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conveniently obtained in the process of inverting by the sweep operator of computational 
statistics (Lange, 2010, Section 7.3). 


6.3 Linear mixed model (LMM) 


The linear mixed model plays a central role in longitudinal data analysis. For the sake of 
simplicity, consider the single-level LMM (Laird and Ware, 1982, Bates and Pinheiro, 1998) 
for n independent data clusters (yj, Xj, Zi) with 


Yi = Xil3 + + Si, i = 1,..., n, 

where /3 is a vector of fixed effects, the 7 ^ ~ Ri{6)) are independent random effects, and 

e, ~ 77(0, u'^Ini) captures random noise independent of 7 ^. We assume the matrices Zi have 
full column rank. The within-cluster covariance matrices Ri{9) depend on a parameter vector 
0; typical choices for Ri{9) impose autocorrelation, compound symmetry, or unstructured 
correlation. It is clear that Y) is normal with mean Xi/3, covariance fij = ZiRi[9) 
and log-likelihood 

Li{f3,9,a‘^) = -^lndetni-^{yi-Xif3)'^ni^{yi-Xif3). 

The next three facts about pseudo-inverses are used in deriving the MM algorithm for LMM. 

Lemma 7. If A has full column rank and B has full row rank, then {AB)~^ = B~^A~^. 

Proof. Under the hypotheses, the representations A^ = {A'^A)'^A'^ = {A^A)~^A'^ and 
= B'^{BB^)~^ are well known. The choice B^A^ = B'^{BB'^)~^{A^A)~^ A^ satisfies 
the four equations characterizing the pseudo-inverse of AB. □ 

Lemma 8. If A and B are positive semidefinite matrices with the same range, then 

\im {B + eI){A + eI)-\B + el) = BA B. 

e4.0 

Proof. Suppose A has spectral decomposition The matrix P = 

projects onto the range of A and therefore also projects onto the range of B. It follows that 

PB = B and by symmetry that BP = B. This allows us to write 

{B + eI){A + eI)-\B + eI) 

= BP{A + el) ^PB + eBP{A + e/)"^ + e{A + eI)-^PB + e^(A + el)~\ 


The last three of these terms vanish as e j, 0; the first term tends to the claimed limit. These 
assertions follow from the expressions 

P(A + eI)-^P = P(A + eI)-^ = (A + eI)-^P = ^ j^Uiuf 

Ai>0 T e 

and e 2 (A + e/)-^ = ^. □ 
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Lemma 9. If R and S are positive definite matrices, and the conformable matrix Z has full 
column rank, then the matrices ZRZ^ and ZSZ'^ share a common range. 

Proof. In fact, both matrices have range equal to the range of Z. The matrices Z and ZR}!"^ 
clearly have the same range. Furthermore, the matrices ZRf^'^ and ZR}I'^R}I‘^Z'^ also have 
the same range. □ 

The convexity of the map {X,Y) i—>■ X^Y~^X and Lemmas and now yield via 
the obvious limiting argument the majorization 




^ (Z,i^,(0W)Zf)(Z,i^i(0)Zf)+(Z,i^,(0W)Z^ + 


a 


m 


a 

4(t) 


2 


= [Z,i^,(0W)Zfzf+]i^^l(0)[Z+Z,i^,(0W)Zn + 

U - 

In combination with the minorization ([^, this gives the surrogate 


2 ^ ^ ^ 2a2 


(y, - - X,/3W) + S, 


for the log-likelihood Li{6,a^), where 


rf = (z+z.H.(eW)zf )nr<‘>(y, - jc,/3«)) = fl.(eW)zfnr<')(y. - jc,/? 


-w 


id)'' 


The parameters 0 and cx^ are nicely separated. To maximize the overall minorization function 

we update via 


^2(4+1) ^ ^2{t) 


\ 


E.iVi - Xi/3'‘'>)mp^‘\yi - X,l3‘-‘> 


^.tr(ri 




For structured models such as autocorrelation and compound symmetry, updating 0 is a low¬ 
dimensional optimization problem that can be approached through the stationarity condition 


for each component 6j. For the unstructured model with RfiO) = R for all i, the stationarity 
condition reads 


^zfrif'z. 



R-^ 
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and admits an explicit solntion based on Lemma 

Similar tactics apply to a mnltilevel LMM ([Bates and Pinheiro, 1998) with responses 




e,-. 


Minorization separates parameters for each level (variance component). Depending on the 
complexity of the covariance matrices, maximization of the snrrogate can be accomplished 
analytically. For the sake of brevity, details are omitted. 


6.4 MAP estimation 

Snppose (3 follows an improper flat prior, the variance components aj follow inverse gamma 
priors with shapes aj > 0 and scales 7 ^ > 0, and these priors are independent. The log- 
posterior density then rednces to 


ln/(/3,cr2|^,X) 

^ - mm 

--Indet n- -{y~ X(3)^n~\y - X(3) - + 1 ) Incx^ _ ^ ^ ^ j'ly) 

i=l i=l ^ 


where c is an irrelevant constant. The MAP estimator of (/3, cr^) is the mode of the posterior 
distribntion. The npdate (|^ of /3 given cr^ remains the same. To npdate cr^ given /3, apply 
the same minorizations ([^ and ([^ to the hrst hrst two terms of eqnation ( jl^ . This separates 
parameters and yields a convex snrrogate for each af. The minimnm of the af snrrogate is 
dehned by the stationarity condition 

4(9 


0 = -^tr(p-WL^) + 


+ 1 


2a f a. 


~ 4 * 

cf 


Mnltiplying this by af gives a qnadratic eqnation in af. The positive root shonld be taken 
to meet the nonnegativity constraint on af. 


For the mnltivariate response model (13), we assnme the variance components Fj follow 


independent inverse Wishart distribntions with degrees of freedom Ui > d — 1 and scale 
matrix >- 0 . The log density of the posterior distribntion is 


L{B,T\X,Y) = -^lndetn-^vec{Y - XB)^n-\ec{Y - XB) 

^ m 1 ^ 

“2 + ^ + “ 2 + 

i=l i=l 


(18) 


where c is an irrelevant constant. Invoking the minorizations and ( [l4j ) for the hrst two 
terms and the snpporting hyperplane minorization 


-IndetTi > -Indetrf-tr{rrW(ri-rf^)} 
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for — In det F, gives the surrogate function 

in m 

9 (r|r(‘>) = 


1=1 


1 


El"' + ‘i+ i)tr(rr'‘’r,) - i E“'(*-rr‘) + 


i=l 


2 = 1 
1 
2 


hi) 


2=1 


The optimal Fj satishes the stationarity condition 

{Id ® InfiiUld ® o ® In) + {ly^ + d+ l)FrW 

= Fri(Ff + «'i)F-^ 

and can be found using Lemma 

6.5 Variable selection 

In the statistical analysis of high-dimensional data, the imposition of sparsity leads to bet¬ 
ter interpretation and more stable parameter estimation. MM algorithms mesh well with 
penalized estimation. The simple variance components model ([^ illustrates this fact. For 
the selection of hxed effects, minimizing the lasso-penalized log-likelihood 

-L{P,cr^) + xY,\M 


is often recommended (Schelldorfer et ah, 2011). The only change to the MM Algorithm 


is that in estimating f3, one solves a lasso penalized general least squares problem rather 
than an ordinary general least squares problem. The updates of the variance components 
af remain the same. For selection among a large number of variance components, one can 
minimize the ridge-penalized log-likelihood 

m 

—L(/3, cr^) -h A ^ (t1 
2=1 

subject to the nonnegativity constraints af > 0. Here the standard deviations (Ji are the 
underlying parameters. The variance update (1^ becomes 


^2(t+l) ^ ^ 2 (t) 


\y - X( 3 d))TQ,-d)v^Q,-d){^y _ X(3 
tr(ri-Wv) + 2A 




i = 1 ,..., m. 


The updates for the fixed effects (3 are unaffected. Equation (19) clearly exhibits shrinkage 
but no thresholding. The lasso penalized log-likelihood 


- L(/3, (t‘^) + A E <Ti 


(19) 


2=1 
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subject to nonnegativity constraint Uj > 0 achieves both ends. The update of Uj is chosen 
among the positive roots of a quartic equation and the boundary 0 , whichever yields a lower 
objective value. 


6.6 Beyond the linear model 

One can extend the MM algorithms to binary and discrete response data with the frame¬ 
work of generalized estimating equations (GEE) ( [Liang and Zeger 1986). Again consider n 
independent data clusters {yi,Xi). In longitudinal studies, {yi,Xi) would be the responses 
and clinical covariates of subject i at different time points. In genetic studies, {yi, Xi) would 
be the trait values and covariates of individuals within family i. GEE captures the within- 
cluster correlation by specifying the first two moments of the conditional distribution of yi 
given Xj, namely /i^ = E{iiij\xij) and = Var(?/jj|a;ij). If one assumes that yij follows an 
exponential family with canonical link, then 

= y{9ij) and ctJ(/3) = i = 1,... ,n, j = 1,... ,ni, 

where y{t) is a differentiable canonical link function, y'{t) is its hrst derivative, 6ij = xjjf3 
is the linear systematic part of yij associated with the covariates, 0 is an over-dispersion 
parameter, and (3 is the vector of fixed effects. 

The GEE estimator of f3 solves the equation 


^(a,/3)[?/i-Mi(/3)] = Op, 


i=l 


where yi = {yn,... ^yimY, Hi{f3) = [yiiif3),...,Vi = cov{yi) is the working co- 
variance matrix of the i-th subject, and dy,i{(3) is the differential of y-i{f3). In longitudinal 
studies, Vi is often parameterized as Vi{cx.,f3) = Ay‘^{f3)Ri{a.)Ay‘^{f3), where is 

a diagonal matrix with standard deviations aij along its diagonal, and R{a.) is a corre¬ 
lation matrix with parameters ck. This parameterization is too restrictive in many other 
applications. For instance, in genetic studies, it is critical to dissect the variance into dif¬ 


ferent sources such as additive, dominance, and household environment (Lange, 2002). This 
suggests the variance component parameterization 


= al2^i + alAyj + alHi + allni, al + a^ + al + al = 1 , 


where in the i-th. family is the theoretical kinship matrix, A 7 j is the dominance variance 
matrix, and Hi is the household indicator matrix. The matrices $j, Ajj, and Hi are 
correlation matrices, and the simplex constraint ensures Ri is as well. In general, the variance 
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component parameterization Ri{(j‘^) = XlJLi the simplex constraint in force is 

reasonable. In this setting the GEE update of (3 given solves the equation 

n 

Y,XlA-'/\l3)R-'{^^)A-'/^l3)lyt-lim] = 0 ,, 

i=l 

This is just the classical GEE update. The difficulty lies in updating cr^ given (3. We propose 
minimizing the sum where 'ip{t) is a scalar convex loss function. Example loss 

functions include the Mahalanobis criterion 

n n 

i=l i=l 

and the sum of squared Frobenius distances 


i=l 










‘)A‘ 


1/2 



The convexity of '0i(t) entails a minorization similar to the minorization (|^. Minimizing 
the surrogate then yields the MM update 


a 


2 ( 1 + 1 ) _ 


Sfci(». - ' it, ‘(<t==<'I)A,. ' (y, - pj) 

E™.. YTUvi - - fi.) 


Under 'ip 2 {t) the MM update boils down to projection onto the simplex. Further exploration 
of these ideas probably deserves another paper and will be omitted here for the sake of 
brevity. 


7 A numerical example 

Quantitative trait loci (QTL) mapping aims to identify genes associated with a quantitative 
trait. Gurrent sequencing technology measures millions of genetic markers in study subjects. 
Traditional single-marker tests suffer from low power due to the low frequency of many 
markers and the corrections needed for multiple hypothesis testing. Region-based association 
tests are a powerful alternative for analyzing next generation sequencing data with abundant 
rare variants. 

Suppose ^ is a n X 1 vector of quantitative trait measurements on n people, X is an 
n X p predictor matrix (incorporating predictors such as sex, smoking history, and principal 
components for ethnic admixture), and G is an u x m genotype matrix of m genetic variants 
in a pre-dehned region. The linear mixed model assumes 

Y = X/3 + G 7 + e, 7~7V(o,a2/), e^N{0,alR), 
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where (3 are fixed effects, 7 are random genetic effects, and cr| and al are variance components 
for the genetic and environmental effects, respectively. Thus, the phenotype vector Y has 
covariance agGG^ + a‘^In, where GG^ is the kernel matrix capturing the overall effect of the 
m variants. Current approaches test the null hypothesis = 0 for each region separately 


and then adjust for multiple testing (Lee et ah, 2014). Hence, we consider the joint model 


y 

li 


— Xf3 + + ■ ■ ■ + 

N{0,aU), er~.N{0,a%) 


and select the variance components af via the penalization (19). Here Sj is the number of 


_^ /2 

variants in region i, and the weights ' put all variance components on the same scale. 
We illustrate this approach using the COPDGene exome sequencing study (http: //www. 


copdgene.org/) (Regan et ah, 2010). After quality control, 399 individuals and 646,125 
genetic variants remain for analysis. Genetic variants are grouped into 16,619 genes to 
expose those genes associated with the complex trait height. We include age, sex, and the 
top 3 principal components in the mean effects. Because the number of genes vastly exceeds 
the sample size n = 399, we first pare the 16,619 genes down to 200 genes according to 
their marginal likelihood ratio test p-values and then carry out penalized estimation of the 


200 variance components in the joint model (19). This is similar to the sure independence 


screening strategy for selecting mean effects (Fan and Lv, 2008). Genes are ranked according 


to the order they appear in the lasso solution path. Table lists the top 10 genes together 
with their marginal LRT p-values. Figure displays the corresponding segment of the 
lasso solution path. It is noteworthy that the ranking of genes by penalized estimation 
differs from the ranking according to marginal p-values. The same phenomenon occurs 
in selection of highly correlated mean predictors. This penalization approach for selecting 
variance components warrants further theoretical study. It is reassuring that the simple MM 
algorithm scales to high-dimensional problems. 


8 Discussion 

The current paper leverages the MM principle to design powerful and versatile algorithms for 
variance components estimation. The MM algorithms derived are notable for their simplicity, 
generality, numerical efficiency, and theoretical guarantees. Both ordinary MLE and REML 


are apt to benefit. Other extensions are possible. In nonlinear models (Bates and Watts 


1988, Lindstrom and Bates, 1990), the mean response is a nonlinear function in the fixed 


effects (3. One can easily modify the MM algorithms to update /3 by a few rounds of Gauss- 
Newton iteration. The variance components updates remain unchanged. 
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Lasso Rank 

Gene 

Marginal P-value 

^ Variants 

1 

DOLPPl 

2.35 X 10-*^ 

2 

2 

C9orf21 

3.70 X 10-5 

4 

3 

PLSl 

2.29 X 10-5 

5 

4 

ATP5D 

6.80 X lO-’^ 

3 

5 

ADCY4 

1.01 X 10-5 

11 

6 

SLC22A25 

3.95 X 10-5 

14 

7 

RCSDl 

9.04 X 10-^ 

4 

8 

PCDH7 

1.20 X 10-^ 

7 

9 

AVIL 

8.34 X 10-^ 

11 

10 

AHR 

1.14 X 10-5 

7 


Table 4: Top 10 genes selected by the lasso penalized variance component model (19) 
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association study of 200 genes and the complex trait height. 
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Figure 2: Solution path of the lasso penalized variance component model (19) in an associ¬ 


ation study of 200 genes and the complex trait height. 
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One can also extend our MM algorithms to elliptically symmetric densities 

g-|K(52) 

(27r)t (det r2)5 

defined for y G M"', where 5'^ = {y — iJ,)'^fl~^{y — y,) denotes the Mahalanobis distance be¬ 
tween y and y. Here we assume that the function k{s) is strictly increasing and strictly con¬ 
cave. Examples of elliptically symmetric densities include the multivariate t, slash, contami¬ 


nated normal, power exponential, and stable families. Previous work (Huber and Ronchetti 


2009, Lange and Sinsheimer, 1993) has focused on using the MM principle to convert param¬ 


eter estimation for these robust families into parameter estimation under the multivariate 
normal. One can chain the relevant majorization k{s) < -|- with our 

previous minorizations and simultaneously split variance components and pass to the more 
benign setting of the multivariate normal. 
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